Goto

Collaborating Authors

 data group


AutoMixer: Checkpoint Artifacts as Automatic Data Mixers

arXiv.org Artificial Intelligence

In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.


Machine Learning Optimal Ordering in Global Routing Problems in Semiconductors

arXiv.org Artificial Intelligence

In this work, we propose a new method for ordering nets during the process of layer assignment in global routing problems. The global routing problems that we focus on in this work are based on routing problems that occur in the design of substrates in multilayered semiconductor packages. The proposed new method is based on machine learning techniques and we show that the proposed method supersedes conventional net ordering techniques based on heuristic score functions. We perform global routing experiments in multilayered semiconductor package environments in order to illustrate that the routing order based on our new proposed technique outperforms previous methods based on heuristics. Our approach of using machine learning for global routing targets specifically the net ordering step which we show in this work can be significantly improved by deep learning.


Aioli: A Unified Optimization Framework for Language Model Data Mixing

arXiv.org Machine Learning

Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity per group. In this paper, we study the cause of this inconsistency by unifying existing methods into a standard optimization framework. We show that all methods set proportions to minimize total loss, subject to a method-specific mixing law -- an assumption on how loss is a function of mixture proportions. We find that existing parameterizations of mixing laws can express the true loss-proportion relationship empirically, but the methods themselves often set the mixing law parameters inaccurately, resulting in poor and inconsistent performance. Finally, we leverage the insights from our framework to derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Empirically, Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.28 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.01 test perplexity points.


Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

arXiv.org Artificial Intelligence

The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there's a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. This focus on quick reward gains undermines both the stability in training and the model's ability to generalize to new, unseen data. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance. Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.


Variance reduced Shapley value estimation for trustworthy data valuation

arXiv.org Artificial Intelligence

The emerging big data in all walks of life has become the driving force of technological and economic development (Ghorbani and Zou, 2019; Huang et al., 2021). Various sectors such as finance and healthcare increasingly rely on individuals' data for predictions, decision-making, and generating business value, which promotes extensive data transactions (Barua et al., 2012). One of the most critical problems in data trading scenarios is data valuation. We consider data trading scenarios in data markets based on machine learning models, such as DATABRIGHT (Dao et al., 2018) and Sterling (Hynes et al., 2018). The data value in this scenario is largely determined by its contribution to a specific machine learning model. We focus on data valuation in supervised learning, which is one of the main pillars of machine learning. The core challenge is how to fairly evaluate the contribution of each data in the training set to the learning algorithm for a particular performance metric. A natural way to handle the aforementioned issue is to treat each data as a player in a cooperative game. Then, the value of each player can be assessed through utility functions from a game-theoretic perspective (Jia et al., 2019b).


An Adaptive Simulated Annealing-Based Machine Learning Approach for Developing an E-Triage Tool for Hospital Emergency Operations

arXiv.org Artificial Intelligence

Patient triage at emergency departments (EDs) is necessary to prioritize care for patients with critical and time-sensitive conditions. Different tools are used for patient triage and one of the most common ones is the emergency severity index (ESI), which has a scale of five levels, where level 1 is the most urgent and level 5 is the least urgent. This paper proposes a framework for utilizing machine learning to develop an e-triage tool that can be used at EDs. A large retrospective dataset of ED patient visits is obtained from the electronic health record of a healthcare provider in the Midwest of the US for three years. However, the main challenge of using machine learning algorithms is that most of them have many parameters and without optimizing these parameters, developing a high-performance model is not possible. This paper proposes an approach to optimize the hyperparameters of machine learning. The metaheuristic optimization algorithms simulated annealing (SA) and adaptive simulated annealing (ASA) are proposed to optimize the parameters of extreme gradient boosting (XGB) and categorical boosting (CaB). The newly proposed algorithms are SA-XGB, ASA-XGB, SA-CaB, ASA-CaB. Grid search (GS), which is a traditional approach used for machine learning fine-tunning is also used to fine-tune the parameters of XGB and CaB, which are named GS-XGB and GS-CaB. The six algorithms are trained and tested using eight data groups obtained from the feature selection phase. The results show ASA-CaB outperformed all the proposed algorithms with accuracy, precision, recall, and f1 of 83.3%, 83.2%, 83.3%, 83.2%, respectively.


A Study of Left Before Treatment Complete Emergency Department Patients: An Optimized Explanatory Machine Learning Framework

arXiv.org Artificial Intelligence

The issue of left before treatment complete (LBTC) patients is common in emergency departments (EDs). This issue represents a medico-legal risk and may cause a revenue loss. Thus, understanding the factors that cause patients to leave before treatment is complete is vital to mitigate and potentially eliminate these adverse effects. This paper proposes a framework for studying the factors that affect LBTC outcomes in EDs. The framework integrates machine learning, metaheuristic optimization, and model interpretation techniques. Metaheuristic optimization is used for hyperparameter optimization--one of the main challenges of machine learning model development. Three metaheuristic optimization algorithms are employed for optimizing the parameters of extreme gradient boosting (XGB), which are simulated annealing (SA), adaptive simulated annealing (ASA), and adaptive tabu simulated annealing (ATSA). The optimized XGB models are used to predict the LBTC outcomes for the patients under treatment in ED. The designed algorithms are trained and tested using four data groups resulting from the feature selection phase. The model with the best predictive performance is interpreted using SHaply Additive exPlanations (SHAP) method. The findings show that ATSA-XGB outperformed other mode configurations with an accuracy, area under the curve (AUC), sensitivity, specificity, and F1-score of 86.61%, 87.50%, 85.71%, 87.51%, and 86.60%, respectively. The degree and the direction of effects of each feature were determined and explained using the SHAP method.


Graph Based Multi-layer K-means++ (G-MLKM) for Sensory Pattern Analysis in Constrained Spaces

arXiv.org Machine Learning

In this paper, we focus on developing a novel unsupervised machine learning algorithm, named graph based multi-layer k-means++ (G-MLKM), to solve data-target association problem when targets move on a constrained space and minimal information of the targets can be obtained by sensors. Instead of employing the traditional data-target association methods that are based on statistical probabilities, the G-MLKM solves the problem via data clustering. We first will develop the Multi-layer K-means++ (MLKM) method for data-target association at local space given a simplified constrained space situation. Then a p-dual graph is proposed to represent the general constrained space when local spaces are interconnected. Based on the dual graph and graph theory, we then generalize MLKM to G-MLKM by first understanding local data-target association and then extracting cross-local data-target association mathematically analyze the data association at intersections of that space. To exclude potential data-target association errors that disobey physical rules, we also develop error correction mechanisms to further improve the accuracy. Numerous simulation examples are conducted to demonstrate the performance of G-MLKM.


The Anatomy of K-Means Clustering

#artificialintelligence

Let's say you want to classify hundreds (or thousands) of documents based on their content and topics, or you wish to group together different images for some reason. Or what's even more, let's think you have that same data already classified but you want to challenge that labeling. You want to know if that data categorization makes sense or not, or can be improved. Well, my advice is that you cluster your data. Information is often darkened by noise and redundancy, and grouping data into clusters (clustering) with similar features is an efficient way to bring some light on.


Data Engineer – Machine Learning

#artificialintelligence

H5's Data Group is looking to add a Data Science Engineer to support legal electronic discovery projects. The ideal candidate will draw on his or her broad technical experience to address complex data needs by providing analytic insights, creating and executing technical solutions, and taking responsibility for projects' data-related needs. The Data Group's priorities are balanced between executing fast-moving projects full of intriguing data, responding to immediate requests, and proactively designing tools for emerging needs. This position is a contract position for the duration of 6 months, with possibility of changing to a full time position, contingent on strong performance by the candidate and also business needs. M.S. or Ph.D. in Computer Science, Machine Learning or NLP 2 years of industry experience at minimum Strong coding and debugging skills in Python Strong working knowledge of machine learning techniques Experience applying machine learning techniques to NLP problems Knowledge of fundamental natural language processing techniques Experience working with Spark and large data sets preferred Experience using SQL for data insight and manipulation Experience with Linux H5 is an Equal Opportunity Employer You can apply to this job and others using your online resume.